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ABSTRACT 



A method optimizes function evaluations performed by of a 
VLIW processor through enhanced parallelism by evaluat- 
ing the function by table approximation using decomposi- 
tion into a Taylor series. 

3 Claims, 2 Drawing Sheets 
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COMPUTER SYSTEM AND METHOD FOR 
PARALLEL COMPUTATIONS USING TABLE 
APPROXIMATION 

This application is a continuation-in-part of and claims 
the benefit of U.S. application Sen No. 09/220^06, filed 
Dec. 24, 1998, now U.S. Pat. No. 6,363,405, the disclosure 
of which is incorporated by reference. This application 
claims the benefit of Application No. 60/068,738, filed Dec. 
24, 1997. 

FfELD OF THE INVENTION 

The present invention relates to processors and computing 
devices and more particularly to compilers for optimized 
multiple function arithmetic execution units in a processor. 

BACKGROUND OF THE INVENTION 

Many practical applications require processing of very 
large amounts of information in a short period of time. 
Examples include weather forecasting, the design and mod- 
eling of complex dynamic systems and others, which appli- 
cations frequently involve repealed estimation of modeling 
functions over a* set of input parameters. 

One of the basic approaches to minimizing the time to 
perform such computations is to apply some sort of 
parallelism, so that tasks which are logically independent 
can be performed in parallel This can be done, for example, 
by executing two or more instructions per naachine cycle, 
i.e., by means of instruction -level parallelism. Thus, in a 
class of computers using superscalar processing, hardware is 
used to detect independent instructions and execute them in 
parallel, often using techniques developed in the early 
supercomputers. 

Another more-powerful approach to exploiting instruction 
level parallelism is used by the Very Long Instruction Word 
(VLIW) processor architectures 'in which the compiler per- 
forms most instruction scheduling and parallel-dispatching 
at compile time, reducing the operating burden at run time. 
By moving the scheduUng tasks to the compiler, a VLIW 
processor avoids both the operating latency problems and 
the large and complex circuitry associated with on-chip 
instruction scheduling logic. 

As known, each VLIW instruction includes multiple 
independent operations for execution by the processor in a 
single cycle. A yUW compiler processes these instructions 
according to precise conformance to the structure of the 
processor, including the number and type of the execution 
units, as well as execution unit timing and latencies. The 
compiler groups the operations into a wide instruction for 
execution in one cycle. At run time, the wide instruction is 
applied to the various execution units with little decoding. 
The execution units in a VLIW processor typically include 
arithmetic units such as floating point arithmetic units. An 
example of a VLIW processor that includes floating point 
execution units is described by R. IC Montoye, et al. in 
"Design of the IBM RISC System/6000 floating point 
execution unit'V IBM J. Res. Develop., V 43 No.l, pp. 
61-62, January 1990. Additional examples are provided in 
U.S. Pat, No. 5,418,975, as well as pending patent applica- 
tions Ser. No. 08/733,480, 08/733,479, 08/733,833, 08/733, 
834, 08/733,831 and 08,733,832, the content of which is 
incorporated herein for all purposes. 

While these processors are capable of performing a vari- 
ety of tasks adequately, it is perceived that the performance 
of VLIW processors can be improved further by optimizing 
them with respect to certain specialized but highly repetitive 
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and often used in practice tasks, such as function evaluation 
using decomposition into Taylor series. 

SUMMARY OF THE INVENTION 

^ A novel method and system is presented for use with a 
VLIW processor to optimize it for use in function evalua- 
tion. In accordance with a preferred embodiment of the 
present invention, a novel approach is presented to enhanc- 
ing parallelism in the evaluation of functions by table 
approximation methods using decompositions into Taylor 
series. 

BRIEF DESCRIPTION OF THE DRAWINGS 

j5 FIG. 1 illustrates in block diagram form the architecture 
of a VLIW processor that can be used in a preferred 
embodiment of the present invention. 

FIG. 2 illustrates the steps in accordance with a preferred 
embodiment of the method of the present invention. 

^ DETAILED DESCRIPTION OF THE 

INVENTION 

no. I illustrates in block diagram form the architecture 
of a VLIW processor that can be used in a specific embodi- 
^ ment of the present invention. The processor generally 
comprises and instruction fetch unit 10; execution unit 90; 
register file 30, the content of which is read in file read unit 
20; instruction cache 40 and data cache 50; and state update 
unit 80. 

Generally, instruction fetch unit 10 acquires active 
ins Unctions via the 1-cache 40. Execution unit 90 comprises 
a set function units 60. Example function units are integer 
arithmetic logic units (ALU), floating point addition and 
multiplication, also included arc data access operation units 
70. Units may be pipelined into stages. Once instructions are 
completed, their result is written in the stale update unit 80 
that writes back results in the register file 30. 

The general architecture of a VLIW processor will not be 
discussed in further detail. Interested readers are directed to 
U.S. Pat. No. 5,418,975, and pending patent applications 
Ser. Nos. 08/733,480, 08/733.479, 08/733,833, 08/733,834, 
08/733,831 and 08,733,832,the disclosures of which are 
incorporated by reference herein. As known in the art, in a 
VLFW architecture, the very long instructions words present 
the scripts for the function units Ud follow at execution time. 
The level of parallelism desired in a particular application is 
achieved using local and global scheduling that enables 
optimum distribution of the workload among different func- 
tional imits. 

In accordance with a preferred embodiment of the present 
invention overall improvement in processing speed in the 
evaluation of certain functions is achieved by representing 
each function as a scries expansion around one or more 
55 function argument values, preferably stored in a table, and 
providing a fast parallel method of computing the expansion 
series for the dx deviation from the stored value of the 
argument. 

More specifically, in accordance with the present 
50 invention, parallel algorithms are provided for the fast 
computation of functions, such as sqrt(x), cbrt(x) and ln(x) 
by table approximation methods using decomposition into 
Taylor series. The method of the present invention is illus- 
trated next in the example of fast parallel sqrt(x) function 
55 computation. 

With reference to FIG. 2, the first stop of the method in a 
preferred embodiment is to divide the range of argument 
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values for ihe approximation into n intervals. In many 
practical applications this range can be assumed as 

0.5<x<l. 

Next, for each of the o intervals, the value of the function 
at the center Xq of the range is determined. For ootational 
simplicity, the index "i" of the interval is omitted. Thus, in 
a preferred embodiment of the present invention, at run time 
all function arguments falling within an the i-th interval is 
evaluated as an approximation of the function using series 
expansion about the center Xq of the interval. The deviation 
of the actual function argument from the Xq value is denoted 
dx. 

Next, to compute, for example, the sqrt(x) function, in 
accordance with the present invention the following expres- 
sion is used: 

sqn(r)-sqii(xO)+sqrt(jri)Vjf '^'(aO''dx*xO'(m-l}Hil''dx'2 •jrO*(m- 
2>+ . . . +a(m-2)*ir"(-l)'JK0+a(m-l)*<ir'>n); (Eqa. 3) 

The values of sqrt(xO) and sqrt(xO)/xO" m are computed 
and stored in a table. The coefiBcients aO, al, . . . , a(m-l) 
are obtained from the function decomposition into Tailor 
series are similarly stored in memory. 

The remaining part of Eqn. 1 is a polynomial of the form 

4-1 

which can be computed conveniently with the use of dif- 
ferent parallel computation schemes, as known in the art. 

The following example illustrates a parallel computation 
scheme for the cbrt function: 

«(I,.<lx«;.(x5.(g.<ir-..] + ^.g| g| ..o)]^ 

where the total number of required arithmetic operations 
K=29, and the length of the critical path for the computation 
of the function evaluation is Tomax(5mul+2add, 4mul+ 
4add). 

It can be appreciated, that formulae similar to Eqns. 1 and 
2 can easily be derived for a number of additional functions, 
such as the cubic root cbrt, and the In functions. These 
functions lend themselves to straightforward expansion in a 
Taylor series. O^ce the expansion is available, \he values of 
the function at the Xq point and the powers of Xq, as required 
in the expansion can be obtained and stored. The remaining 
part of the series expansion lends itself parallel computing 
that greatly reduce the time required for the function evalu- 
ation. 

In accordance with a preferred embodiment of the present 
invention, the number of intervals o into which the range of 
function arguments is divided is determined by constraints 
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on the size of the utilized tables of constants and the required 
accuracy. The constant m is found in a preferred embodi- 
ment on the basis of the size of the intervals, i.e., n, and 
requirements to the compulations accuracy. The accuracy of 
s the computation can be determined using the expressions for 
the error in Taylor series expansions. 

Finally, in accordance with a preferred embodiment, 
reduction of the argmnent to the required approximation 
range and obtaining of the final result after the compulations 
in the interval are performed in traditional way. 

Although the present invention has been described in 
cotmection with the preferred embodiments, it is not 
intended to be limited to the specific form set forth herein, 
15 but on the contrary, it is intended to cover such 
modifications, alternatives, and equivalents as can be rea- 
sonably included within the spirit and scope of the invention 
as defined by the following claims. 

What is claimed is: 

20 

1. A computer method for compiling function evaluation 
on a parallel computing system comprising the steps of: 

dividing up the range of function arguments into n values, 
and determining the center Xq for each interval; 
^ detennining the value of the function at Xq, the m-th 
power of Xq and the first m coefficients a(i) of the Taylor 
series expansion of the function and storing said values 
in a memory, where m is a number selected on the basis 
of the desired accuracy of the computation; 

for a given argument x positioned at a distance dx from Xq, 
evaluating a polynomial of the type 

35 fa" 

to compute summands of said polynomial in parallel; 
and 

40 combining the values stored in the memory and the 
evaluation of said polynomial so as to provide an 
evaluation of the function at the x argument value. 

2. The method of claim 1 further comprising the steps of: 
dividing up the evaluation of a polynomial into two or 

45 more independent tasks; 

determining the longest independent task, defined as a 

critical path for the polynomial evaluation; 
minimizing the processing time for the critical path by 

changing the operations order; and 
scheduling a sequence of tasks among said plurality of 

function units, wherein completion of all tasks results 

in the polynomial evaluation. 

3. The method of claim 2 changing the operations order 
J J comprises replacing multiplication operations with additions 

in the critical path. 

4> + 4> * « 
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ABSTRACT 



A technique for the evaluation of a general continuous 
function f(x) is presented, and the design of an interpo- 
lating memroy. an implementation of the technique, is 
described. The technique partitions the domain of f(x) 
into segments, and defmes an interpolating (or approxi- 
mating) function for each. The implementation, is a 
memory subsystem that holds the parameters of the 
approximating functions and yields an interpolated 
function value on each read reference. Polynomial in- 
terpolating functions are considered in particular. Hard- 
ware requirements (memory and computational logic) 
are analyzeid in terras of the required precision. It is 
shown that as long as f(x) has d + 1 derivatives, and d is 
the degree of the interpolating polynomial, d-f 1 addi- 
tional bits of precision of the computer f(x) arc obtained 
for each additional address bit used in the interpolating 
memory. This establishes a tradeoff between memory 
and computational logic, which can be exploited in the 
design of a unit for a specific function, for any precision 
requirement. Furthermore, . a single unit may be de- 
signed for any class of functions that have the required 
derivatives. Two examples of implementations for par- 
ticular functions are presented. 

5 Oaims, 5 Drawing Sheets 
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and . significant computation, and for efficient use of 

INTERPLATING MEMORY FUNCTION storage. But the practical usefulness of logarithmic rep- 

EVALUATION resentation schemes depends critically on efficient con- 

versions through the logarithm and exponential func- 
BACKGROUND OF THE INVENTION 5 tions. 

Several computer applications — digital signal pro- A recent work by Lo and Aoki shows how these 

cessing in particular^require the repeated evaluation of functions can be evaluated quickly through the use of a 

a few specific functions. Usually, function values are programmable logic array (PLA). Their scheme is to 

obtained by computing the terms of a series to the re- tise the single-segment linear ^proumation for logzx 

quired degree of precision. But this may be unaccept- for l^x^2, but then to add error-correcting values 

ably slow, especially when there are real-time con- obtained from a PLA. The error corrections, truncated 

straints. Alternative solutions are either table-lookups to the precision required of the result, arc found to be 

or special-purpose arithmetic units. But the table- constant. in intervsds of x. Economy in the required 

lookup approach is useful only for limited-precision PLA size is obtained by encoding it to provide correc- 

^pUcations, since the table size grows exponentially ^5 tions for each group of x values. The speed of this tech- 

with the argument precision, and special-purpose hard- njque is significant: the result is obtained after only two 

ware d«ig°cd for each required function may be cxpen- i^^^ i^^els in the PLA, followed by one addition. 

81VC and inflexible. authors have also demonstrated itierative 

A number of previous works have reported on the ^^^^ for the evaluation of the logarithm. These 

design of specnal-purpose hardware for the evaluanon 20 ^^^^^^ implemented with a smaller hardware in- 

of particular functions. Tlie base-two loganthm and correspondingly slower, 

exponential have been the functions most considered, . 

because of the usefulness of the logarithmic transforma- SUMMARY OF THE INVENTION 

tion in multiplication, division, and exponentiation. T-t_-- -j i 

The work of Mitchell is a fundamental exposition of 25 This myenuon provides a function evaluator which 
the idea of implementing the log function in hardware. ^ mteipolatmg memory for evaluating functions. 
Any positive number y can be written in the fonn An argument is expressed m digital bits. High order bits 
y=2*x. where k is an integer and x is between one and ^« address a memory for selectmg coefficients, 
two. MitchcU proposes using the straight line x - 1 as an order bits are used in combinational logic to sup- 
approximation to log2X for l^x^2. But the straight- 30 ply powers which multiply the coefficients. An adder 
line approximation has a maximum absolute error of sums the products ofthe multipliers, and the summation 
0.086 in the interval. This can lead to an error of 1 1.1 is the function evaluation. 

percent for multiplication implemented by logarithms. Floating point arguments with exponents and mantis- 

Combet et al. propose using several straight-line seg- sas in digital form are used with the system. Low order 

ments for the approximation of logzx. The coefficients 35 bits of the exponent or high order bits of the mantissa or 

of the linear segments are selected for efficient imple- both are supplied to the decoder of a set-associative 

mentation in hardware ais well as reduction of the maxi- memory to select a set An associative search is made by 

mum error. The study results in a hardware design that a function representation and high order bits of the 

uses four segments, reducing the maximum error to exponent. 

0.008. 40 A memory word selected in the search contains an 
Hall, Lynch, and Dwyer also propose the piecewise exponent and coefficients. The coefficients are supplied 
linear approximation to the logarithm, as well as to the to an evaluator which is similar to that previously de- 
exponential. Their analysis is based on the minimization scribed, and the summation from that evaluator and the 
of the squared error. Numerical results are shown for exponent are supplied to a nonnalizer. The output ofthe 
approximations by up to eight segments. The maximum 45 nonnalizer is the function evaluation, 
error of the least-squares linear approximation to logix ^h, mvention is an interpolating memory, a digital 
IS shown to be 0.065 for one segment and 0^0062 for electronic device, capable of quickly evaluating mathc- 
four segments (compare, to the cases above). For eight ^^^^ functions. In its operation, it evaluates a selected 
segments, he error is 0.00167 - approxmiatdy a four- ^jy^omial approximation to the desired function, 
fold reduction over the four-segment case^Tbe use of 50 hardware consists of: 

the log-exp transformation is shown for a digital filter, -"^ «- • * r*i. 

reauiiSiB 6-bit precision memory bank contaimng the coefficients of the 

A woA by Marino considers the use of two second- approximating polynomials, which is addressed by 

degree polynomial segments to approximate log2X. The J''^. argj^ment x; . ^ , , 

compuution of x2 is approximated in order to reduce it 55 ^) Combmational logic for computing the values t2, 

to adding and shifting operations. A maximiim absolute ^1"* . , ^ ^^1 ^ . . ^ 

error of 0.0040 is achieved for logzx. ^) ^ «^ ^f parallel multipliers for computmg the 

Brubakcr and Becker analyze the evaluation of loga- polynomial; and 

rithms and exponentials by ROM table-lookup without An adder to sum the terms of the polynomial, 

interpolation. They consider multiplication by direct 60 A second embodiment accepts floating-point argu- 

table-lookup and by the addition of logarithms. Their ^ents and produces fioating-pomt function values, over 

highest precision example is a multiplication with an the domain of all floating-point numbers. It can hold the 

error of 0.1 percent, in which the operands arc 1 1 biu coefficients of several functions at once, 

and the product 10 bits. In this case, the memory re- The second embodiment has the following exten- 

quired for multiplication via logarithms is smaller by a 65 sions: 

factor of 50 than that required for direct table-lookup. a) The input argument x is in floating-point form with 

Several studies have shown the effectiveness of the exponent E and mantissa M. 

logarithm form of number representation, both for fast . b) A function identification f is provided as an input. 
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c) A set-associative memory is used to select the polynomial coefficients and an exponent as outputs 
memory word containing the polynomial cdeffici- from the memory. CocflTicients of the word and lower 
ents. The associative memory set is selected by k order bits of the mantissa are supplied in parallel to a 
bits taken from both the low-order bits of E and the polynomial evaluator, as previously described, 
high-order bits of M. 5 The exponent from the word and the output of the 

d) An associative search is made, with f and the high- polynomial evaluator are supplied to a floating-point 
order bits of E constituting the search key, within normalization unit An output of the floating-point nbr- 
the selected set of associative memory register to malization unit provides the function evaluation, 
select the memory word containing an exponent In a preferred method, an exponent and mantissa 
and the iiolynomial coefficients. 10 register or bus receives an input floating-point argu- 

e) Polynomial coefficients are multiplied with powers ment A function representation register or bus receives 
of the low order bits of M. a function representation A set-associative memory has 

f) The memory word that holds the polynomial coef- a function input connected to the function register or 
ficients also holds a floating-point exponent bus and has an exponent input coimected to receive 

g) A floating-point normalization takes place after the IS high order bits from the exponent register or bus. A 
polynomial evaluation. decoder connected to the exponent and mantissa regis- 

The present invention provides the basic concept of ter or bus section receives either or both of low order 
the interpolating memory »the parallel hardware evalu- bits from the exponent and high order bits from the 
ation of polynomial segments selected by bits of the mantissa. According to that input, the decoder of the 
argument The floating-point embodiment includes de- 20 set-associative memory selects a memory set. An asso* 
tails that make it more useful. ciative search is made in that set with the function input 

. The present invention provides a generalized func- bits and the high order bits from the exponent register 
tion evaluation unit that combines table-lookup and or bus. The search selects from the set-associative mem- 
arithmetic logic. A tradeoff between memory size and ory a word having a floating-point exponent and having ' 
amount of combinational logic allows a designer for a 25 polynomial coefficients. A polynomial evaluator multi- 
partictilar application to select among various configu- plies the polynomial coefficients with values consisting 
rations. The generalized unit can yield values for a of the low order bits from the mantissa register or bus 
broad class of functions. raised to integer powers. A floating-point normalization 

An interpolating memory apparatus includes a mem- unit is supplied with the floating-point exponent from 
ory bank containing coefficients of approximating poly- 30 the memory and to the simimed output of the polyno- 
nomials, combinational logic for computing powers of a mial evaluator and produces a function evaluation, 
given variable, plural multipliers for multiplying the In one form of the apparatus a floating-point interpo- 
powers of the variable and the polynomial coefHcients, lating memory has a function binary representation 
and an adder conn'ected to outputs of the multipliers for input and a floating-point argument register. The float- 
summing the products of the powers of the variable and 35 ing-point argument register has an exponent section and 
the coefficients of the polynomial. The sum is the func- a mantissa section. A set-associative memory has an 
tional evaluation. address input and an associative search key input. The 

The interpolating memory function evaluation in- function identiflcation and high order bits in the expo- 
eludes the parallel evaluation of polynomial segments nent section are connected to the search key.input of the 
selected by bits of the function argument. 40 set-associative memory. An address decoder in the set- 

The preferred method uses bits of a function argu- associative memory receives low order bits from the 
ment to address a memory holding coeflficients and uses exponent section, high order bits from the mantissa 
other argument bits in combinational logic in parallel section, or both. A memory unit is connected to the 
with the addressing of the memory. The invention mul- output of the set-associative memory. The memory unit 
tiplies in parallel coeflicients from the memory and 45 holds a floating-point exponent and polynomial coeffici- 
sequential powers' of the argument bits obtained from ents. A polynomial evaluator, as previously described, 
the combinational logic and combines the products in is connected to plural polynomial coefficient outputs 
an adder. from the memory unit The polynomial evaluator has an 

The function evaluation includes using a binary rep- input connected to receive low order bits from the 
rcsentation of a number, addressing a memory with one 50 mantissa section. A floating-point normalization unit 
part of the representation and supplying anoUier part of receives an output from the polynomial evaluator and 
the representation to combinational logic units in paral- an output of the floating-point exponent output of the 
lei with the memory, supplying coefficients from the memory unit. The floating-point normalization unit has 
memory to parallel multipliers and supplying values * an output which supplies the function evaluation, 
from the combinational logic to the multipliers, obtain- 55 A preferred method of evaluating floating-point func- 
ing products, adding the products and thereby produc- tion values includes inputting a function identification f, 
ing a function evaluation. inputting a floating-point argument x in a binary repre- 

* When used with a floating point argument a binary sentation having exponent £ and mantissa M, supplying 
function identification is supplied to a set-associative the function identification f input to a set-associative 
memory. The exponent and the mantissa of a floating 60 memory, supplying high order bits of £ to a set-associa- 
point argument are expressed in binary tenns. Address- tivc memory, supplying low order bits of E and/or high 
ing the associative memory with low order bits of the order bits of M to a set selector for selecting a memory 
exponent or high order bits of the mantissa* or both set from the set-associative memory, making an associa- 
selects a set of associative memory registers. Supplying live search in the selected set with f and the high order 
the function identiflcation and the high order bits of the 65 bits of E, selecting a memory word containing a floating 
exponent to the set-associative memory makes an asso- point exponent and polynomial coefTicienis from the 
ciative search in the selected set of associative memory selected associative memory set, supplying parallel out- 
registers, and, upon finding a match, selects a word of puts of the polynomial coefHcients to a polynomial 
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evaluatoT, supplying low order bits of M to the polyno- FIG. 5 is a schematic representation of a floating 

miaJ evaluator, supplying the floating-point exponent to point interpolating memory. 

a floating-point normalization unit, supplying aii output DETAILED DESCRIPTION OF THE 

of the polynomial evaluator to the floating-point nor- DRAWINGS 

malization unit and supplying a function evaluation 3 

output torn the normalization unit FIG. 1 schematically represents an interpolating 

Broadly stated, the interpolating memory funcUon memory generally indicated by the numeral 1. The 

evaluation method of the present invention comprises argument 3 is input to a register 5 as high order bits and 

accessing a memory unit by using specified bits of a lower order bits. A number k of high order bits from 

function argument for the memory unit address, obtain- 10 register 5 are used as address i of the memory umt 7. 

ing parameters of an approximating function from the The memory word is compo^ of d fields, each of 

memory unit, and evaluating, using other specified bits which holds a polynomial coeffiaent 9, 11, 13 . . . 15. 

of the argument, the specific approximating function Low order bits from register 5 represent the variable t. 

whose parameters were obtained from the memory unit which mcreasmg powersare supplied by combuia- 

The pTeferred accessing comprises using k bits of the > 5 t«>^ ^<'S^^ * ; "^^'"^^ '^^/'^^^^j'^^^ 

fimction argument to address the memory unit Hie the appropmte coefficients a^.-.a^ to outputs 23 2^^ 

memory u^t contains polynomial coefficients. The • ' •„^\^%rP"^,r.-^"^«^ 

evaluatLg comprises evaluating with fixed combina- P^^^!l^° multipbcrs 35 37 . . . 39. The multipliers are 

. L J J 1 11- «• • * supplied wrth powers of t from the combinational logics 

tional logic the degree d-polynomial whose coefficienU 47 ... 49 of the multipliers'^e 

are obtamed from the mempi^ u^^^^ usmgas the polyno- coefficient 23 from the 

mial argument the bits of the function argument of PP directly to adder 41. The output44 

lower order than the k bits used to address the memory ^^^^ ^^^^ ^PP^^^ ^^^.^^ evaluation. The func 

%ie preferred evaluation of the polynomial is per- ^^^^^^^ generally represented by the numeral 

formed using.as an input to the iK»lynomial evaluator as later will be described is a schematic repre- 

thevaluctcompnsedofbitsofthefunction argument of g^tation of the required hardware. The numbers n-l, 

lower order than the k bits used to address the memory ^ ^ represent the bit positions in register 5. 10 

unit, computing the quantities t^. t3. . . . t^ m fixed com. represents one field of the memory word. 20 is an exam- 

binational logic, computing the terms of the polynomial 3^ combinaUonal logic units. 30 is an example of 

in a set of multipliers operatmg m paraUel, whose mputs multipliers. Rectangles 22, 32 and 42 are representa- 

are the coefficients obtained from the memory umt and qJ- ^jt positions in the respective words on the bus- 

the powers of t, and adding the terms of the polynomial. ses 24, 34 and 44. 

The system accepts a floating-point function argu-. FIG. 5 represents a floating point interpolating mem- 
ment with exponent field E and mantissa M, and using 35 59 having a function identification input 51 and an 
k function argument bits, including zero or more low jj^p^ 53 argument x, expressed as an exponent 57 
order bits of E and zero or more high order bits of M as qj^^ ^ mantissa 59, in an input register 55. A set-associa- 
an address of a register set in a set-associative memory. tjyg memory 61 is connected to a fimction identification 
An associative search is performed in the associative bus €3 and an exponent bus 65, which provides high 
memory set selected by the address, using the high 40 order bits 67 from the exponent 57 in register 55. 
order bits of E as the associative search key, obtaining decoder 69 of the set-associative memory 61 re- 
from the memory location associated with a successful cdves low order exponent bits 71 and high order man- 
match in the associative memory, a floating-point expo- tissa bits 73. The outputs 75 of the decoder select a 
nent of a function value and the d + 1 polynomial coeffi- particular set 76 of associative memory registers. The 
dents. The polynomial specified by the coefficients is 45 associative search is made with the function input 63 
evaluated with the polynomial argument specified as and the high order bits of the exponent input 65. If the 
the low-order bits of M. The function evaluation is selected set contains values of f on bus 63 and the high 
obtained by normalizing the floating-point value coii- order exponent bits on bus 65, the corresponding mem- 
sisting of the exponent obtained from the memory and ory word 71 is selec^d. The memory word 71 holds a 
the mantissa ol)taincd as a result of the polynomial eval- 50 floating point exponent 73 of the identified function and 
uation. also holds the coefficient 9, 11, 13 and 15, as shown in 

The particular function to be evaluated is supplied as FIQ. 1. Parallel outputs 75 through 77 of the cocffici- 

a function identification f, and the associative search ents in word 71 are supplied to a polynomial evaluator 

key is comprised of both f and the high order bits of E. 46 (as shown in FIG. 1), which also receives low order 

These and other and further objects and features of 55 bits on bus 81 firom the mantissa M in register 55 (5 in 

the invention are apparent in the disclosure which in- FIG. 1). The results supplied to output 83 is the summa- 

cludes the above and ongoing specification and claims tion of the products of the coefficients and the powers 

and drawings. of the low order bits of M. The floating point normal- 

^« A nrrvT^o izBtion occurs in normalization unit 85, which also re- 

BRIEF DESCRIPTION OF THE DRAWINGS „ ^ i„p„, ^us 87 of the floating pomt expo- 

FIG. 1 is a schematic representation of an interpolate nent 73. Result 89 is the floating point representation of 

ing memory function evaluation. the evaluated fimction. 

FIGS ZA. 2B and 2C are giBphic «pr«enU^^ of j OUTLINE OF THE DESCRIPTION 
error of log2X with linear mterpolation. 

FIG. 3 b a representation of maximum error flog 65 In this study, we consider a general technique that 

scale) versus address bits k. combines the table-lookup (memory) and computa- 

FIG. 4 is a schematic representation of parameters for tional (arithmetic logic) approaches to function evalua- 

hardware requirements. tion. Tlie hardware realization of the technique will be 
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called an interpolating memory. It is in essence a mem- n i - . 

cry unit that contains a subset of the required valued of Polynomial Approxmiation Segments 

f(x) together with the means of interpolating additional Consider the specific case of function evaluation in 

function values on each read reference. We study the which the approximating functions are polynomials, 
particular case of interpolation by polynomial segments. 5 Let Ax=(o — 6)*2 -* be the length of an approximating 

. The analyse shows how the error and hence the segment. For i=0, ... 2^—1, let segment i be defmed 

precision of the result varies with the interpolating by: 
memory size, and with the degree of the interpolating 

polynomiaL {jc|a+/AjSjr<a+(^+i)Az}. 

The main result b that the precision of the computed 1^ 

f{x) is a linear function of the number of address bits. If Let yX0=/(fl+'^+5), and let g/(t) be an approxima- 

an interpolating polynomial is of degree d, each addi- W) for O^t^ Ax. 

tiona! address bit provides d + 1 additional biU of preci- Suppose gy(t) is a polynomial of degree d, 
sion of the f^x) value. This unifies and generalizes the 

earlier results on table-lookup and linear and second-de- d . (i) 

grec approximations. It leads to a convenient rule by ~ ^=0 ' 
which a design for a given degree of precision may be 

' chosen by trading off memory and arithmetic logic. Then the interpolating memory unit must hold at ad- 

Furthcrmorc, this result is shown to hold for any dress i the coefficients ao(i), .... a<yO)' 
function that has d + 1 derivatives. Hence, if the interpo- The hardware organization required for the straight- 
lating memory is implemented with RAM, it can be forward interpolation by polynomial of degree d is 
used for x-> (obviating the need for division hardware), shown in FIG. 1. The k address bits constitute the seg- 
for trigonomctic functions, for general functions com- ment number i, and are used to address the memory unit 
puted by Fourier series, etc., as well as for the logarithm holding the coefficients. The memory size is 2* words, 
and exponential. A generalized design must have bit The m— k interpolating bits constitute the variable t. 
widths in the memory and arithmetic logic that are The t* values, h = 2. ... ,d, are obtained in combinational 
sufficient for each function, A method for determining logic in parallel with the memory read. The products 
these hardware requirements is presented. aA(Oi* are obtained in parallel multiply units, the bit 
. In Section D, the design of the generalized interpolat- widths of which are usually less than that of the result, 
ing memory is outlined. In Section III, I discuss the The products are combined in the fmal adder, 
error and precision of f(x), and show the error charac- The time required for a function evaluation is the sum 
teristic (in terms the number of address bits) of the log, of the memory access time, the time for the widest 
exp, and sine functions. In Section IV, I show that the multiplication, and the delay in the adder. The exact 
upper bound of the error holds in theory, for all func- .33 timing values, and the quantitative hardware require- 
tions that have a sufficient number of derivatives. In ments (the bit widths of the coefficients and the multi- 
Section V, I derive the total hardware requirements - pliers), are determined by the precision required of the 
memory size and data width of the memory and arith- result. However, a given precision of the result can be 
metic units - for a given precision. In Section VI, two obtained in various combinations of d and k. The preci- 
particular examples of interpolating memory design are 40 sion is determined by the error of the polynomial ap- 
presented. Section VII compares the timing and hard- proximation.- 
warc requirements to otiicr techniques. Our conclusions 

are in Section VIH. I". ERROR OF THE POLYNOMIAL 

APPROXIMATION 

II. FUNCTION EVALUATION BY t * y x w t • 1 r ^ 

APPROXIMATING SEGMENTS *^ ^'^^^ ^ ^ polynomial of degree d approximatmg 

fXt) over segment i. Let 

Consider the evaluation of a function f(x)'over the 

evaluation interval a^x<b, where a=c2" and *AO={fKO~gKOl 
*=(c+ 1)2", for arbitrary integers c and n. If the desired 

interval for an application does not meet these condi- 50 The error of a given type of approximating function 

tions, the smallest subsuming interval that docs may be is defined to be the maximum absolute error over all the 

chosen. Let xj=x~a. The binary representation of x £ segments of the evaluation interval. For 2* segments, let 
[a, b) may be paititioned as follows: the higher order 
bits (bit n and higher) represent the constant 32*", and 

the lower order bits represent x/ c[0, 2"). If function 53 (3) 

values are to be obtained at 2*" points in the evaluation ^* = / «X0 Y 

mterval, x/must be q;>ecified with a precision of at least . 0 s i < 2\o s t £ t x )' 
m bits. 

In the technique of the interpolating memory, the Eft is actually an upper bound on the error of the 

evaluation interval is partitioned in to 2* segments, and 60 device, since it is the maximum error of the continuous 

an approximating function is specified for each segment. approximating functions, and the interpolating memory 

The most significant k bits of x/identify the segment In evaluates these functions only at discrete points, 
an implementation, these bits will be used to address the 

storage unit holding the parameters of the approximat- Precision 

ing function. They will be called the address bits. The 65 The precision of a quantity is the number of bits in its 

next m— k bits of x/specify the point within the approxi- binary representation. If bounds on the error of a quan- 

mating segment at which the evaluation is to take place. tity are known in the design phase, the precision can be 

These will be called the interpolating bits. specified so as not to introduce additional error, and not 
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to implement meaningless bits. In considering the imple- 
mentation of a qxiantity whose error can be bounded, I . . (5) 
will use the term "precision" to refer to the most rea- ^ id d ' 
sonable implementation of the quantity. The futcd-pbint m« UiWI S -jgrr^ m W **" (Olmax 
precision of the result will be said to be to bit q if the ^ ' 
error is not more than the value of bit position q. That 
is. if 



d 

n (/ - 



' m« n \(y-j/d)Ax\ = 

0 s> s iy«o 

where 



for O^t^Ax. 
With the variable change ^'^f/Ax, the product term 

Ek^V or q^hg^KJEid- (5) may be rewritten: 

10 

The complete precision is determined by the msb 

position, which depends on the maximum possible mag- ^ n (r - 

nitude of the quantity, as well as by the fixed-point q ^ / 5 A*|y=o 
precision. 

Precision as a Function of Number of Address Bits 

If k bits are used to address the memory unit holding 

the coefficients, the function is evaluated over 2* seg- ^ (7) 

mcnts. As an example, consider the evaluation of log2X, ^ n |(y - j/d)\. 

for 1 ^x<2, by means of the linear Lagrange interpola- ~ o S S J;=0 
tion. The Lagrange interpolating polynomial of degree 

d coincides with the function value at d -h I evenly Then (5) becomes 
spaced points in the approximating segment. FIG. 2 

shows Ejt for the cases k=0. 1.2. 25 (8) 

Ea. for Lagrange polynomial appro:dmations to ^,,^,5, ^ ^f^l" ouxl/^+')(ote(^. 

log2X, is plotted on a loganthmic scale m FIG. 3, Also , " W^W , 
plotted are the error characteristics for the functions 2*, 

evaluated for 0^x<l, and sin (x), evaluated for ypflTimnm error over the entire interval is the 

0^3i<2. The cases d=l, 2, and 3 are given for each 30 maximum of (8) over /=0. .... 2*- 1. Since Ax=2-* 

function. {b '-~a), one may write: 

The error characteristic in each case approaches a 
linear asymptote on the logarithmic scale. The slope of (9) 
this asymptote is the same for each ofthe functions, and ^ 2'id'¥\)k , ^ i/^+'^Wlocrf) 
depends only on the value of d. For linear approxima- 35 * (</ -t- i)! ^ 0 S x s * 
tions, Ejt is reduced by a factor of four for each incre- 
ment of k on the linear portion of the characteristic. Since the maximum error of the degree-d Lagrange 

For the second and third degree approximations, the polynomial approximation has a bound that it propor- 
crror is reduced by factors of 8 and 16, respectively, for tional to 2-<*'+ the allowable precision of the result 
each additional address bit. In general, it seems that for 40 (isb position) is proportional to -(d+l)k. Further- 
an approximating polynomial of degree d, the precision more, this characteristic will hold for any function that 
of the result (as determined by the error) is increased by had d + 1 derivatives over the domain of interest In the 
d -I- 1 bits for each added address bit. In the next section, xituX section. I show how the hardware requirements 
this result will be shown to hold for the Lagrange poly- are determined from the specified precision of the re- 
nomial approximation to any function that has d+1 suit. This will allow the design of interpolating memo- 
derivatives, ries, implemented in RAM, that may be used for a broad 

The Lagrange polynomial is easy to compute and is class of functions, 

wnenable to analysis, but other polynomial approxima- haut^wapp PFnTTiPFMFKTS 

tions yield smaller errors. If a least^uares polynomial V. HARDWARE REQUIREMENTS 

is used, the resulting plot of Ejt is virtually the same as A method is presented for determining the widths of 

FIG. 3, except that each curve is displaced downward, the data paths for the computation of f(x), given a speci- 

rcprcsenting a further reduction of Ejtby a factor of 1.5 fication of the maximum error (or the required preci- 

to 2. sion) of the result. The data path widths determine the 

. hardware required for the arithmetic logic and the 

IV. ERROR BOUND FOR TKE LAGRANGE 55 ^^^^ Efficients. Signed magnitude represents- 

APPROXIMATION jj^^^ assumed for all quantities. Onc's-and two's- 

Thc existence of an error bound for the Lagrange complement representations will have almost exactly 

interpolating polynomial is well known. Here, we ex- the same hardware requirements. The storage and ma- 

tcnd the result to show the error bound as a bound on go nipulation of sign bits is not explicitly considered, 

precision, and the bound on precision as a function of It is assumed that fl[x) is to be evaluated over [a. b), 

the number of address bits of the interpolating memory. where 6— c = 2", and that a b a multiple of 2". Only bits 

Let g/(t) be the Lagrange interpolating polynomial of less significant than bit n in the biiiary representation of 

degree d approximating f/(t) over interval i, 0=t^ Ax. x vary over the domain of interest Let bit n— m be the 

Then gXty)=fi(t;) at the evenly spaced points //=yAx/rf, 65 lowest order implemented bit of x. It is also assumed 

for j=0, . . . , d. It is shown in 02 that as long as f/(t) has that the argument x has no error: all bits of lower order 
d+l derivatives, the error of the degrcc-d Lagrange than bit n-m are identically zero. The notations used 

approximation is bounded as follows: for the relevant bit positions are shown in FIG. 4. 
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Step 1. Determine qrand qu, respectively the Isb and Step 4. Determine su the msb position of t*. Since t is 

msb positions of J{x)=gi(t), Assume that the design specified in bits n— k — 1 to n— m, t* will be specified in 

requirements for an interpolating memory are given in bits h(n— k)— 1 to h(n— m). Therefore, 
terms of either the specification of the least significant 

bit position qrbf the result, or else a bound E on the 5 (I7) 

absolute error. Then , , , . ^ 

Note that the lowest order bit of t* that must be imple- 

££2ff» (10) mented is bit h(n— m). Hence, 

Note that if the magnitude of f(x) is to be specified in 10 (18) 

bits q„ to q^ where q« ^q^ c*rT^* - *i.iurci.n 

Step 5. Determine Wv, the Isb of aA. For h=0, Wr=rv. 

For h>0, the effect of the multiplication a^t* must be 

considered. In the Appendi^i, it is shown that the preci- 

B«L/U)I S 2flir + I - 2ff^ . 15 sion of a product is determined by the following rule. If 

' A and B are unsigned numbers specified in bit positions 

avto arand buto b^, respectively, with errors not greater 

than half the Isb value, then 0= AB may be specified in 

bits cu to where Ci/=flu+^u+ 1. and 

(12) 20 

ff« = f lo82(nmlAJt)l + 2fl0l - 1. c,£ar+6^+2, (19a) 

c,Sff„+i»r+2. (196) 

Step 2. Determine d, the degree of g/(t), and k, the 
number of address bits for the memory of coefficients. 25 of C is not greater than the Isb value. 

Since various designs are possible through tradeoffs of d current case, it is assumed that the only error of 

and k, one parameter must be specified independently. and the argument x is the roundoff error, which is 
A range of designs can be evaluated as d is varied. bounded by half the Isb value. Then, applying the rule 

For any function with continuous derivatives, the (^^^) product aAt^ 

characteristic of precision as a function of k has the 30 



which leads to 



form shown in FIG. 3. For each of these curves, the 



(20) 



Straight-line asymptote is also a bound on the error. iarecst Isb is 

This bound is 



tpgiEk^hgiEoL'id-i'Dk. (13) 35 



„ ^ , . • , ... , . , Step 6. Determine Wy, the msb of a*. Following the 

The EoL value is dctcrmmed by evaluating the maxi- ^^i^ | 

mum Ek for any value of k larger than all design possi- 
bilities, and projecting a line of slope - (d+ 1) through 

the point obtained, to k=0. Alternatively, for Lagrange 40 ' rf Vi ^ ' 

interpolating polynomials, the bound may be obtained » logJ mu|oA(0| + 2*^ t I. 

by taking the logarithm of (9). \ ^ 

Since the result, will include roundoff error of one 
half Isb, the error of the approximation must be limited Step 7. Determine s^, the Isb of t*. Applying the rule 
to half the value of bit q^. Then, to ensure logz (19^) to a^t*. 
£a<^,-1. set 

r,^Wy+Sr+2, (23) 

4W2£Oi.-(*/+0*<?r-1 (14) 

from which the largest possible Sr value is determined to 
and solve for the minimimi integer value of either d or 50 \^ rp— flu— 2. But as noted in Step 4, bits of t* that are 
k. of lower order than bit h(n~m) are identically zero. 

Step 3. Detennine r^, the Isb position of all terms a^t^. Therefore, 
for h=0, Ir . . . , d. An error of half the Isb value may 

result from rounding the sum of the polynomial terms. tr^maxirf wy~x Hn~m)), (24) 

Therefore, the error of the sum must not be greater than 

2?^- 1, and thus the error of each of the d -f 1 terms must Step «. Determine r„ the msb position of a^t* In the 
not be greater than 2ff'-i/(d+l). Then, AppendU, it is shown that as long as the product is 

rounded to r^, satisfying (21) and (23), then 

2r, g ~ ^ « r.^j^+H-.+ l. (25) 

Two particular examples of interpolating memory 
from which the largest value of u is determined: design arc presented in the next section. 

^ . - I - nog,(. + iq . (16) 65 EXAMPLES OF DESIGN 

Consider the evaluation of log2(x) for 1 ^ x < 2. In this 
For h=0, steps 5 and 6 are performed. Steps 4-8 are case Ax=l, and n=0. Assume x is specified toJ4fucd- 
performed for h= 1, . . . , d. point precision (ro= 14), and logz(x) is also to be given 



1 
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to 14-bit precision. Suppose the approximation is by 
means of linear segments, fitted by the least-squares 
criteria. The hardware requirements arc computed by 
the algorithm of the previous section. 

Step 1. We'assume qy=s — 14 is a design requirement 
The error of f(x) is bounded by E^2-"..Then (12) 
yields qi/=0. 

Step 2. We consider linear (d=l) approximating 
polynomials. The bound Eol=0.084 is obtained from 
the linear, least-squares error characteristic (similar to 
FIG. 3) for the log function. Then (14) yields k^5.71. 
The smallest' possible memory for 14- precision has 
k=6, or 64 addresses. 

Step 3. For the d=l case, (16) yields Tj,= - 16. 

Step 5 (h=0). Set Wv=rv= -16. 

Step 6 (h=0). Since there are 2* segments for the 
proposed design, the last segment, which yields the 
largest value of ao, will begin at x=2 — 2-*. Then max 
flo=/og2{2-2-6)=a977. Then (22) yields 

Step 4 (h=l). Equation (17) yields S„=-7. Also, 
— 14, 

Steps (h=l). By (21), Wv=-ll 
Step6(h=l). By (22), 

Step 7 (h= I). By (24), Sy= - 14. 
Step 8. By (25), ru=-6. 

Thus, a 64-word memory is required. The word 
width is 28 bits: 16 bits for the ao values, and 12 bits for 
the 31 values. The multiply unit is 12X8 bits. 

Example of Higher Degree Interpolation 

In this example, we consider a range of design possi- 
bilities for the computation of sin (x) to a fixed-point 
precision of 48 bits. Since the argument x must be evalu- 
ated between X) and 7r/2, a value of Ax =2 is chosen, and 
D = 1. Assume x is specified to 48-bit precision (m=48). 

Steps 1 and 2 of the design algorithm yield qv« -48 
(the design. requirement), |E|S2-^, and q„=0. The 
algorithm is applied to evaluate the design choices d=2, 
3, and 4. 

The least-squares error characteristic for sin (x), for 
d=2, yields log2Eot« -4.65. Then k=15 is obtained 
in step 3. A memory unit of 32K words is required. 



10 



IS 



VII. COMPARISONS TO OTHER METHODS 

The interpolating memory is a generalization of the 
table-lookup and the linear and quadratic approxima- 
tion methods of [i]-l5j. These works describe limited- 
precision and restricted (one ftinction only) versions of 
the present technique. Iterative techniques [10], [U] 
require an order of magnitude greater computation 
time, and so arc not directly comparable. The differ- 
ence-grouping FLA (DGPLA) of Lx> and Aoki is a 
noniterative function evaluation imit, capable of evalu- 
ating the log function. The timing and hardware re- 
quirements of the interpolating memory and the 
DGPLA will, be compared. 

The time required for a function evaluation by an 
interpolating memory may be observed in FIG. 1. The 
generation of the required powers of t is overlapped 
with the memory access for the polynomial coefficients. 
All of the multiplications are performed in parallel. 
^ With array multipliers, the time for m-bit multiplication 
is bounded by the delay of 2m full adders [14]; but in this 
case the widest multiplier is even less than m bits. The 
time for the fmal addition is largely overlapped with the 
multiplication time. Thus, the time for a ftinction evalu- 
ation is the sum of the access time of a memory of k 
address bits, followed by 2m fiill adder stages. 

The time requirement of the DGPLA is of a similar 
order. The DGPLA consists of a PLA followed by one 
addition sUge. The PLA has only a three-gate delay. 
However, use of the PLA limits the precision of the tmit 
to the fan-in of one gate. For higher precision designs, a 
ROM. with m address bits, is required. For the 14-bit 
implementation discussed below, the interpolating 
memory access time is estimated to be about twice as 
great as that of the DGPLA. The timing estimates are 
made without the use of additional speedup logic, such 
as carry-lookaheads (from which both devices could 
benefit) and redundant encodings for the multiplications 

Comparison of the hardware requirements of the 
interpolating memory and the DPGLA is problematic 
because the interpolating memory scheme admits of a 
variety of implementations, with tradeoffs of logic and 
memory. And the measure of memory for a PLA is 
qualitatively different from that of a RAM or ROM, 
even though both are expressed in bits. With these res- 
ervations, the interpolating memory is compared to the 
DGPLA, for the 14-bit logarithm implementation of 
Applying the remainder of the algorithm yields the ^ Section VI. The interpolating memory requires 1.8K 



25 



30 



35 



40 



45 



design parameters shown in Table I. The memory word 
width is 1 13. bits, for a total memory requirement of 
361 6K bits. Two multipliers are required.: 39x33 bits 
and 23x23 bits. 

For d— 3, the least-squares error characteristic shows 55 
Iog2EoL= -7.75. Then Step 3 yields k= 1 1. A memory 
unit with 2K words is required. Table II gives the re- 
mainder of the design parameters. The memory word 
length is seen to be 146 bits, for a total memory require- 
ment of 292K bits. Three multipliers are required: «> 
43X38 bits, 32x32 biU, and 20x20 bite. 

For d=4, the error characteristic shows log- 
2Eoz.= - 10.75, from which k= 8 is determined in Step 
3. A 256-word memory is required. The remaining de- 
sign parameters are given in Table HI. The word width 65 
is seen to be 187 bits, for a total memory requirement of 
47K bits. Four multipliers are required: 46x40 bits, 
38x38 bits, 30x30 biU, and 21x21 bits. 



bite of memory. In addition, it requires a 12 X 8 bit multi- 
plier, and a 16 bit adder, a total of 112 full adders. 

Using the technique for measuring the PLA size 
given in [9], the DGPLA requires 1 lOK programmable 
PLA bite for a 14-bit precision implementation of the 
log function. However, if the error terms are produced 
by a ROM, the unit would require 2»^ addresses, and 
have a word size of 1 1 bite, for a total memory of 176K 
bite. The final stage requires 14 full adders. 

Vm. CONCLUSIONS 

It has been shown that the precision of a degree-d 
polynomial approximation of any function that has d-h 1 
derivates increases linearly with the number of bite used 
to address the interpolating memory. Then it is possible 
to design an interpolating memory for any specified 
precision with a variety of tradeoffs of memory and 
arithmetic logic (i.e., polynomial degree). 
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Suppose the argument is given to m-bit precision. |W| = 2''»+'-2*^ (A6fl) 

Possible designs range from the case of k=m, which is 
amply a lookup table of size I'", to k=0, in .which the and 
function is computed from a single set of polynomial 

cocfncients. 5 max |fi|=2*»+'-2^ (A6i) 

Because the slope of the precision characteristic is the 
same for any function with the required differentiabil- Substituting these maximum values in to (A2), 
ity, a single unit may be designed to serve for a wide 

class of functions. The memory unit could be imple- |,^j|=2''»+fr»+2'^'+*»'-2'''^^+2-'+*'-2, ^^7) 

. mented in RAM and the contents switched for different 10 
functions. Or sections of a single unit, selected by one or ^om which 
more address bits, could be used for different functions. ^ . . ^ , 

A RAM-based unit could be a central computing ele- i,,,<2««<''»+*^'+^)+i. (a8) 

ment in a multifunction arithmetic pipeline organiza- 
tion. 15 . Replacing max I Cai I in (A5) with the bound in (A 8), 

The only differences in the hardware requirements ^ seen that C may be implemented to bit 
for the various functions are in the magnitudes of the 

function argiunents and the polynomial coefficients. c,Sm« (j«-6ptf,+6„)+2, (A9) 

These quantities do not differ by more than a few bits 

for the most common functions. The procedure for the 20 with |ec<2*^. 

generalized design is simply to apply the design algo- As another case, suppose A and B are obtained as a 

rithm for desired function independently, and then result of a process that has some error. Then the bounds 

select for each quantity the maximum of the msb posi- I ^a<2^ and | e£,<2*v are a reasonable assumption. (In 

tion and the mintmum of the Isb positions. panicular, this is the bound assumed for the result of the 

25 multiplication.) Reasoning as in (A6HA9) for this case 
APPENDIX i^jj 

THE FIXED-POINT PRECISION OF 

MULTIPLICATION f,£max (u„+ft^flr+M+3. (AIO) 

Let the real numbers A and B have the fmite-preci- ta-w r .j.- i^i. . 

sion signed-magnitude binary representations A and §. To specify c„ one must determine whether the 
in which bits a. and b„, respectively, are the msb's, and ""^^^^l carry through bit a« + - 

bits a.and b^ respectively, arc the Isb's. Ut Caand ctht ^+ ^* "^^'^ ^^^^^ f"^^^«^ represented m 

•the respective errors of the representations: ^=i+ea ppsitions a«+b,+ l to av+bv, such that when 

and B^B-^Cb- < ''^^^^^ position Cp, a carry out of bit position 

Consider the product ^ a„+bu-|-l will result. Clearly, n(cr) consists of ones 

from bit position a„+bw-|-l to bit position Cy— 1, and 
c^AB^Ah^tah (Ai) zeros in aJl tower bit positions: 

where ii(cr)-2'"'+*"+2-2<>-». 

'«6=AA+5f«+*tf«A (A2) If AS is rounded to bit position Cv, carry overflow will 

not take place as long as 

is the error of the finite-precision product AB, Note that 

the largest nonzero bit of AB is in a position no greater max <n(cv). (au) 

than au+bv+l, and the smallest nonzero bit in a posi- 

tion naJess than ay+bp. • Using the maximum values in (A6), this condition bc- 

Let C, with msb c« and Isb Cy^represent C. Let Cf be comes 
the error of the representation: C=C+ee. C is obtained 

by rounding AB to bit position Cr. The error Cr intro- 2«»'+N+i^2iir+*w+i_2ff»+*»22 2o-». <ai2) 



duccd by rounding is defined by 

(A3) 



Then from (A I) and (A3) is seen that 

IM^'-*l+kr|. (A4) 



With au^a^and bv^b^, (A 12) holds if 

c,<max (o,+6^o,+A«)+2. (AI3) 



^ Therefore, for the case in which the errors <)f A a] 
B are bounded by half their Isb values^ and C is 
•J ^ I I 1 -r rounded to bit Cy given by (A9). overflow from round- 

But max |e.| =2—1. Then if pj^^^^ c„-a, + 6.-h 1. 

mox\eab\<i'^''\ (A5) 60 ^ which the CTTors of A and B are 

^ . * bounded by their respective lsb*s, and C is AB rounded 

the result C can be implemented to Isb Cv, with |ec<2*^. as given by (AIO), overflow from rounding may 

-Consider the case in which initial representations of place, and Cw=au+i«+2. 

A and B can be obtained with high precision and negli- While the invention has been described with rcfer- 

gible error. Then the errors of the representations A and 65 cnce to specific embodiments, modifications and varia- 

B are rounding errore only: eo^2*'»^^ and eA^2*»-^ tions may be constructed without departing from the 

The maximum absolute values of A and B are scope of the invention, which is defined in the following 

claims. 
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TABLE I 






Panmeters for d 


= 2 






b 






Si/ . 




. ftt 


r» 


0 


-1 


-51 










1 


0 


-38 


-13 


-47 


-14 


-51 


2 


-2 


-24 


-29 


-51 


-30 


-51 



TABLE II 



Ptrtroetcrs for d'= 3 



h 


w. 








To 


Tr 


0 


-1. 


-51 










1 


0 


-42 


-11 


-48 


-10 


-51 


2 




-32 


-21 


-52 


-21 


-51 


3 


-3 


-22 


-31 


-30 


-33 


-31 



TABLE III 



Pamneters for d = 4 



b 




Wr 


Sv 


Sr 


ru 




0 




-52 










1 




-46 


-8 


-47 


-8 


-52 


2 




-39 


-15 


-52 


-16 


-52 


3 




-32 


-22 


-51 


-24 


-52 


4 




-25 


-29 


-49 


-33 


-52 
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I claim: 

1. An interpolating memory function evaluation ap- 
paratus, comprising: a memory address means for ac- 
cessing a memory unit by using first bits of a function 30 
argument for the memory unit address, an approximat- 
ing means for obtaining parameters of an approximating 
function from the memory unit, and an evaluating 
means for evaluating the approximating functioOt using 
other bits of the function argument, the specific approx- 35 
imating function whose parameters were obtained from 
the memory Onit, wherein the accessing comprises 
using k bits of the function argument to address the 
memory unit wherein the memory unit contains polyno- 
mial coeflicients. wherein the evaluating comprises 40 
evaluating with fixed combinational logic a degree d- 
polynomial whose coefficients are obtained from the 
memory unit, using as the polynomial argument the bits 
of the function argument of lower order than the k bits 
used to address the memory unit, wherein the evalua- 45 
tion of the polynomial is performed using as an input to 

a polynomial e valuator a value t coinpnsed of bits of the 
function argument of lower order than the k bits used to 
address the memory unit, computing the qtiantities t^, 
t^, . . . , t^ in fixed combinational logic, computing terms 50 
of the polynomial in a set of multipliers operating in 
parallel, whose inputs are the coefficients obtained from 
the memory unit and the powers of t, and adding terms 
of the polynomial. 

2. The apparatus of claim 1, further comprising ac- 35 
cepting a floating-point function argument, with expo- 
nent field E and mantissa M, and using k function argu- 
ment bits, including zero or more low order bits of E 
and zero or more high order bits of M as an address of 

a register set in a set-associative memory, performing an 60 
associative search in the associative memory set se- 
lected by the address, using the high order bits of E as 
the associative search key, obtaining from the meniory 



location associated with a successful match in the asso- 
dative memory, a floating-point exponent of a function 
value and the d+l polynomial coefficients, evaluating 
the polynomial specified by the coefficients with poly- 
nomial argument specified as the low-order bits of M, 
and normalizing the floating-point value consisting of 
the exponent obtained from the memory and the man- 
tissa obtained as a result of the polynomial evaluation. 

3. The ^paratxis of claim 2, wherein a particular 
function to be evaluated is supplied as a function identi- 
fication f, and the associative search key is comprised of 
both f and the high order bits of E. 

4. An apparatus for evaluating floating-point func- 
tions, comprising a first input for inputting a function 
identification f, a second input for inputting an argu- 
ment X in floating-point form with exponent E and man- 
tissa M, means for supplying the function identification 
f input to a set-associative memory, means for supplying 
high order bits of E to the set-associative memory, 
means for supplying low order bits of E and high order 
bits of M to a set selection input of the associative mem- 
ory, for selecting a portion of the associative memory, 
means for making an associative search in the selected 
portion of the memory with f and the high order bits of 
E, means for selecting, as a result of the associative 
search, a memory word containing polynomial coeffici- 
ents and a floating-point exponent, means for supplying 
parallel outputs of a polynomial coefficients to a poly- 
nomial evaluation unit, means for supplying low order 
bits of M to the polynomial evaluation unit, means for 
supplying the floating-point exponent and an output of 
the polynomial evaluation unit to a floating-point nor- 
malization unit in obtaining floating-point representa- 
tion of the function evaluation from the normalization 
unit M of the normalizer. 

5. A floating-point interpolating memory, comprising 
a floating-point argument input having an exponent 
section and an mantissa section, a function identification 
input; a set-associative memory having a function input 
and an exponent input, the function input being con- 
nected to the function identification input, high order 
bits in the exponent section being connected to the 
e;q>onent input of the set-associative memory, a set- 
selector connected to the set-associative memory, at 
least one of low order bits from an exponent register 
and high order bits from a mantissa register being con- 
nected to the set selector for selecting associative mem- 
ory set from the set-associative memory, a word mem- 
ory unit cotmected to the set-associative memory hav- 
ing words selected by an associative search in the se- 
lected set, the selected memory words including a float- 
ing-point exponent and polynomial coefficients, a poly- 
nomial evaluator connected to plural outputs of the 
word memory imit holding the polynomial coefficients, 
the polynomial evaluator having an input connected to 
low order bits from the input mantissa register, a float- 
ing-point normalization unit connected to an output of 
the polynomial evaluator and having an input con- 
nected to a floating-point exponent obtained from the 
word memory unit, the floating-point normalization 
unit having an output of the function evaluation. 
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ABSTRACT 



A method optimizes function evaluations performed by of a 
VLIW processor through enhanced parallelism by evaluat- 
ing the function by table approximation using decomposi- 
tion into a Taylor series. 

2 Claims, 2 Drawing Sheets 
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COMPUTER SYSTEM AND METHOD FOR lion. In accordance with a preferred embodiment of the 

PARALLEL COMPUTATIONS USING TABLE present invention, a novel approach is presented to enhanc- 

APPROXIMATION METHODS ing paraUelism in the evaluation of functions by table 

approximation methods using decompositions into Taylor 

This application claims benefit of Provisional Appln. No. s series 
60/068,738 filed Dec. 24, 1997. 

FIELD OF THE INVENTION BRIEF DESCRIPTION OF THE DRAWINGS 

The present invention relates to processors and computing FIG. 1 illustrates in block diagram form the architecture 

devices and more particularly to compilers for optimized of a VUW processor that can be used in a preferred 

multiple function arithmetic execution units in a processor. embodiment of the present invention. 

BACKGROUND OF THE INVENTION F\G. 2 illustrates the steps in accordance with a preferred 
Many practical apphcations require processing of very embodiment of the method of the present invention, 
large amounts of information in a short period of time. riPTAir cn nnc/-DipT^nw np tup 
Examples include weather forecasting, the design and mod- 15 iail^u u^^j^^^is ut int. • 
cling of complex dynamic systems and others, which appli- 
cations frequently involve repeated estimation of modeling FIG, 1 illustrates in block diagram form the architecture 
functions over a set of input parameters. of a VUW processor that can be tised in a specific embodi- 
One of the basic approaches to minimizing the time to ment of the present invention. The processor generally 
perform such computations is to apply some sort of 20 comprises and instruction fetch unit 10; execution unit 90; 
parallelism, so that tasks which are logically independent register file 30, the content of which is read in file read unit 
can be performed in parallel. This can be done, for example, 20; instruction cache 40 and data cache 50; and state update 
by executing two or more instructions per machine cycle, unit 80. 

i.e., by means of instrucaon-level parallehsm. Thus, in a Generally, instruction fetch unit 10 acquires active 

class of computers using superscalar processing, hardware is 25 instructions via the I-cache 40. Execution unit 90 comprises 

used to detect independent instructions and execute them in a set function units 60. Example function units are integer 

parallel, often using techniques developed in the early arithmetic logic units (ALU), floating point addition and 

supercomputers. multipUcation. also included are data access operation units 

Another more powerful approach to exploiting instruction 70. Units may be pipelined into stages. Once instructions are 

level parallelism' is used by the Very Long Instruction Word 30 completed, their result is written in the state update unit 80 

(VUW) processor architectures in which the compiler per- that writes back results in the register file 30. 

forms most instruction scheduling and parallel dispatching xhe general architecture of a VLIW processor will not be 

at compile time, reducing the operating burden at run time. discussed in further detail. Interested readers are directed to 

By moving the scheduling tasks to the compUer, a VUW y g p^j 5,418,975, and pending patent application Ser. 

processor avoids both the operating latency problems and 3S 08/733,480, 08/733,479, 08/733,833, 08/733,834, 

the large and complex circuitry associated with on-chip 08/733.831 and 08,733,832, the disclosures of which are 

instrucUon scheduling logic. incorporated by reference herein. As known in the art, in a 

As known, each VLIW instruction includes multiple VUW architecture, the very long instructions words present 

independent operations for execution by the processor in a the scripts for the function units to follow at execution time, 

single cycle. A VUW compiler processes these instructions jg^^j of parallehsm desired in a particular applicadon is 

according to precise conformance to the structure of the achieved using local and global scheduling that enables 

processor, including the number and type of the execution optimum distribution of the workload among different func- 

units, as well as execution unit, timing and latencies. The tional units. 

compiler groups the operatioiB into a wide instruction for accordance with a preferred embodiment of the present 

execution m one cycle. At run time, the wide instruction is 45 ^^^^^^^ „^„3„ improvement in processing speed in the 

applied to the various execution units with li.ttle decodmg. evaluation of certain functions is achieved by representing 

The execution uniu « a VLIW processor typically include ^^^^^ j^^^j^^ ^ ^ expansion around one or more 

anthmetic units such as floating point arithmetic units. An ^^^^-^^ argument values, preferably stored in a table, and 

example of a VLIW processor that includes floatmg point providing a fast parallel method of computing the expansion 

execution umts is described by R. K. Montoye et al. in 50 ^^.^^ ^ dcvvHion from the stored value of the 

"Design of the IBM RISC System/6000 floatmg pomt argument 

execution unit", IBM J.Res. Develop., V. 43 No.l, pp. j -.l .u 

61-62, January i990. Additional examples are provided in . ^^J^ specifically m accordance with the presen 

U-S. Pat. No. 5,418,975, as well as pending patent apphca- "^^^"^°°r ^''f^^ algonthms are Provided for the fas 

tion Ser. Nos. 08/733,480, 08/733,479, 08/733,833, 08/733, 5S computation of fiincUons, such as sqrt(x), cbrt(x) and ln(x) 

834, 08/733,831 and 08,733,832, the content of which is \^^^^ approxmiaUon inethods usmg decomposition into 

kor^;„ f«r Taylor series. The method of the present mvention is lUus- 

incorporated nerem tor all purposes. . . j . ■ i r r f hi \ cl 

„ J^., . r _r • tratcd next m the example of fast parallel sqrt(x) function 

While these processors are capable of performing a van- utat on 

cty of tasks adequately, it is perceived that the performance . . " ^ _ ^ ^ . . r .u *t. j • 

of VLIW procesors <4n be improved further by optimizing « Wi«h reference to RG. 2, the first step of the .nethod in a 

them with respect to certain specialized but highly repeUtive V'f"^ embodiment is to divide the range of argument 

and often used in practice tasks, such as funcUon evaluation ^'J"''.* f"' aPP>-o''Vn'""'o '"1° n intervals. In many 

using decomposition into Taylor series. P"«'"' applicauons this range can be assumed as 0.5<x<l. 

Next, for each of the n intervals, the value of the function 

SUMMARY OF THE INVENTION ^5 the center Xq of the range is determined. For notational 

A novel method and system is presented for use with a simphcity, the index "i" of the interval is omitted. Thus, in 

VLIW processor to optimize it for use in function evalua- a preferred embodiment of the present invention, at run time 
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all function arguments falling wthin an the i-th interval is 
evaluated as an approximation of the function using series 
expansion about the center Xq of the interval. The deviation 
of the actual function argument from the Xq value is denoted 
dx. 

Next, to compute, for example, the sqrt(x) function, in 
accordance with the present invention the following expres- 
sion is used: 

sqriOcHqn(xO)+sqit(xO)/rO'm* (aO 'dx •x(r(m- l>f a 1 'dx^a *xtr(m- 
2>+ . . . +a(m-2)*dx'(-l)-xO+a(m-l)*dx*m); (Eqn. 1) 

The values of sqrt(xO) and sqrt(xO)/x(r m are computed 
and stored in a table. The coefficients aO, al, . . , ^m-l) are 
obtained from the function decomposition into Taylor series 
are similarly stored in memory. 

The remaining part of Eqn. 1 is a polynomial of the form 

which can be computed conveniently with the use of dif- 
ferent parallel computation schemes, as known in the art. 

The following example illustrates a parallel computation 
scheme for the cbrt function: 
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Finally, in accordance with a preferred embodiment, 
reduction of the argument to the required approximation 
range and obtaining of the final result after the computations 
in the interval are performed in traditional way. 

Although the present invention has been described in 
connection with the preferred embodiments, it is not 
intended to be limited to the specific form set forth, he rein, 
but on the contrary, it is intended to cover such 
modifications, alternatives, and equivalents as can be rea- 
* sonably included within the spirit and scope of the invention 
as defined by the following claims. 

What is claimed is: 

1. A computer method for compiling function evaluation 
on a parallel computing system comprising the steps of: 

providing an execution unit having a plurality of function 
units, each function imit capable of performing one or 
more arithmetic-logic operations; 

dividing up the range of function arguments into n values, 
' determining the center Xq for each interval; 

determining the value of the function at Xq, the m-ih 
power of Xq and the first m coefi&cients a(i) of the Taylor 
series expansion of the function and storing said values 
in a memory, where m is a number selected on the basis 
of the desired accuracy of the computation; 

for a given argument x positioned at a distance dx from Xq, 
evaluating a polynomial of the type 



where the total nimiber of required arithmetic operations 
Ks29, and the length of the critical path for the computation 
of the function evaluation is T=max(5mul+2add, 4mul+ 
4add). 

It can be appreciated, that formulae similar to Eqns. 1 and 
2 can easily be cferived for a number of additional functions, 
such as the cubic root cbrt, and the In functions. These 
functions lend themselves to straightforward expansion in a 
Taylor series. Once the expansion is available, the values of 
the function at the Xq point and the powers of Xq, as required 
in the expansion can be obtained and stored. The remaining 
part of the series expansion lends itself parallel computing 
that greatly reduce the time required for the fimclion evalu- 
ation. 

In accordance with a preferred embodiment of the present 
invention, the number of intervals n into which the range of 
function argum^is is divided is determined by constraints 
on the size of the utilized tables of constants and the required 
accuracy. The constant m is foynd in a preferred embodi- 
ment on the basis of the size of the intervals, i.e., n, and 
requirements to the computations accuracy. The accuracy of 
the computation can be determined using the expressions for 
the error in Taylor series expansions. 



using the function units of said execution unit to 
35 compute summands-of said polynomial in parallel; and 

combining the values stored in the memory and the 
evaluation of said polynomial as to provide an evalu- 
ation of the function at the x argument value. 

2. The method of claim 1 further comprising the steps of: 

dividing up the evaluation of a polynomial into two of 
more independent tasks; 

determining the longest independent task, defined as a 
critical path for the polynomial evaluation; 
45 minimizing the processing time for the critical path by 
replacing multiplication operations with addition 
operations; and 

scheduling a sequence of tasks among said plurality of 
fimction units, wherein completion of all tasks results 
50 in the polynomial evaluation. 

* * « « * 
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